Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: File row splits #1709

Merged
merged 26 commits into from
Dec 30, 2024
Merged

feat: File row splits #1709

merged 26 commits into from
Dec 30, 2024

Conversation

AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Dec 18, 2024

Introduces a new VortexReadHandle that can be used cloned and used to either generate a full stream over the underlying file OR read a specific row range.
The handle also implements Clone so it can be re-used over the same file.

DataFusion support will be in a followup PR.

@AdamGS AdamGS requested a review from gatesn December 18, 2024 15:30
@AdamGS AdamGS marked this pull request as ready for review December 18, 2024 15:30
@AdamGS AdamGS changed the title [WIP] feat: File row splits feat: File row splits Dec 18, 2024
@@ -26,11 +26,11 @@ pub(crate) trait ReadMasked {

/// Read an array with a [`RowMask`].
pub(crate) struct ReadArray {
layout: Box<dyn LayoutReader>,
layout: Arc<dyn LayoutReader>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So LayoutReaders are no longer stateful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like I overlooked some layers of of state/locks in the readers, that will take some thinking and probably additional refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok after a more careful reading and a small test, as far as I can tell while the LayoutReader implementations are stateful (with interior mutability), they can read independently from each other as the LayoutBufferedReader is the part that actually controls the overall flow and each stream initializes a new one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be multiple concurrent reads being tracked in the layout reader but are they reusable?

I thought something encoded assumptions that splits were accessed monotonically but maybe that was in the BufferedReader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so too, but that does seem to be in the BufferedLayoutReader. The LayoutReader implementation read through poll_read which takes a RowMask to read try and read a specific range on each call, and the BufferedReader tracks that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to access the layoutreaders out of order so we can read multiple at the same time irrespective of their children.

Copy link
Contributor

@gatesn gatesn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a step in the right direction. I still think we should pull apart BufferedReader thingy.

And are you sure all the state machines have gone from layout readers?

vortex-file/src/read/handle.rs Outdated Show resolved Hide resolved

// Set up a stream of RowMask that result from applying a filter expression over the file.
let mask_iterator = if let Some(fr) = &self.filter_reader {
Box::new(BufferedLayoutReader::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we pull the splits_stream out of this thing and just stream::iter(self.splits().iter()).map(|s| self.read_range(s).boxed()).buffered(64) or whatever?

Or is that not possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly worried about using buffered everywhere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffered is a bit too naive. You want more control over coalescing

layout_reader: Arc<dyn LayoutReader>,
filter_reader: Option<Arc<dyn LayoutReader>>,
messages_cache: Arc<RwLock<LayoutMessageCache>>,
row_mask: Option<RowMask>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to provide a handle-level row-mask? This sort of feels like a parameter you pass when you build a scan from the handle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flow I imagine is something like:

  1. Read some external index to create initial RowMask
  2. Pass that row to the the builder to create a handle.
  3. Reuse the initial mask to create concurrent reads on subranges of the file

@gatesn gatesn enabled auto-merge (squash) December 30, 2024 17:05
@gatesn gatesn merged commit 1196436 into develop Dec 30, 2024
20 checks passed
@gatesn gatesn deleted the adamg/row-splits branch December 30, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants